First of all, a new directory in the DNA High-Performance Computing Cluster was created, which is
/mnt/Timina/bioinfoII/amarin/RNAseq2/
This directory has the next organization:
|-data/ # Directory to download the raw data
|-data_trimmed/ # Directory to save the post-QC data
|-kallisto_out/ # Directory to save the kallisto outputs
|-QC/ # Directory to perform raw QC
|-QC_trimmed/ # Directory to perform QC on the trimmed data
|-scripts/ # Some scripts used to send jobs
The dataset choose to analysis has de GEO Accession GSE91061. Since downloading from the NCBI may be difficult and quite technical, it was choosen to use the European mirror, the European Nucleotide Archive from the EMBL-EBI. By browing on the GEO web page, we see that the data are from the BioProject PRJNA356761. We use that ID to search on ENA.
Due the fact that the are a lot of file needed to be downloaded, we use the web application SRA Explorer to get a script to get the bulk download.
We then select all
the files and click on “Add to collection”. After that, we see our saves
datasets and select the bash script for downloading FastQ files option.
We save the script with the name
downloadScript.sh, and
save it in the HPCC, in the //data/ directory.
Then, we send a job using the script. It lasted around 5 days.
Once the data is ready, we perform quality control. We just send a
job using fastqc with all the fastq files via a
wildcard:
fastqc /mnt/Timina/bioinfoII/amarin/RNAseq2/data/*fastq.gz -o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC
Then, we used multiqc to visualize all the samples at once:
multiqc /mnt/Timina/bioinfoII/amarin/RNAseq2/QC -o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC
This allowed us to see some irregularities. First, the first to nucleotides of each read had a smaller mean quality score, compared to the rest of the read.
Besides, in the heatmap, the base sequence sequence content seemed to be biased on the first base pairs of the read.
All the samples resembled more or less the next percentage of base content.
Nevertheless, MultiQC didn’t detect any adapter content. Based on that, we decided to trimm the first 12 nucleotides of each pair.
We send a job in the //data/ directory with a script
using the software Trimmomatic:
for i in *_1.fastq.gz;
do echo
trimmomatic PE -threads 40 -phred33 -trimlog trim.log \
$i "${i%_1.fastq.gz}_2.fastq.gz" \
../data_trimmed/"${i%_1.fastq.gz}_1_trimmed.fastq.gz" \
../data_trimmed/"${i%_1.fastq.gz}_1_unpaired.fastq.gz" \
../data_trimmed/"${i%_1.fastq.gz}_2_trimmed.fastq.gz" \
../data_trimmed/"${i%_1.fastq.gz}_2_unpaired.fastq.gz" \
HEADCROP:12
done
Here we just cut the first 12 nucleotides of each read and create a
trim.log log file by the use of 40 threads. We know that
the Phred score are based on ASCII 33 because previously we found a
# symbol on the reads.
This job took a couple of days.
Once the data was trimmed, we did one more time the quality control.
fastqc /mnt/Timina/bioinfoII/amarin/RNAseq2/data_trimmed/*fastq.gz \
-o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC_trimmed
multiqc /mnt/Timina/bioinfoII/amarin/RNAseq2/QC_trimmed \
-o /mnt/Timina/bioinfoII/amarin/RNAseq2/QC_trimmed
This time, the quality of the first pair bases was better.
Besides, the “per base sequence content” was less biased:
Each sample was more homogeneous.
Besides, other indicators, such as the “per base N content” and “per sequence GC content” performed better.
Once the data was ready, we decided to generate count matrices using Kallisto. To perform this, we first need a reference transcriptome. We choose the latest version available at GENCODE, which is the release 43. The file was downloaded at the parent directory of this project:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_43/gencode.v43.transcripts.fa.gz
Then, generated the kallisto index:
kallisto index -i /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/Hs_ref.kidx \
/mnt/Timina/bioinfoII/amarin/RNAseq2/gencode.v43.transcripts.fa.gz
Then, we used a bash script to perform the quantification with all the samples.
for file in /mnt/Timina/bioinfoII/amarin/RNAseq2/data_trimmed/*_1_trimmed.fastq.gz
do
op=$(basename $file | cut -c-10 )
file2=$(echo $file | sed 's/_1_/_2_/')
mkdir /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/${op}
kallisto quant -i /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/Hs_ref.kidx \
-o /mnt/Timina/bioinfoII/amarin/RNAseq2/kallisto_out/${op} \
--threads 40 $file $file2
done
For the differential expression analysis we used DESeq2, a Bioconductor package running on R 4.3.0 using a PC. The next, are the libraries to use.
library(plotly)
library(tximport)
library(tidyverse)
library(DESeq2)
library(ggplot2)
library(ggrepel)
library(rhdf5)
library(GenomicFeatures)
library(dplyr)
library(biomaRt)
library(gprofiler2)
library(ComplexHeatmap)
library(genefilter)
library(clusterProfiler)
library(AnnotationDbi)
library(org.Hs.eg.db)
#library(ggupset)
library(enrichplot)
First, we imported the kallisto quantification tables to a local computer from the DNA HPCC. Then, we created a tx2gene table, which maps annotated genes with transcripts. To do this, we used the Bioconductor package biomaRt.
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
tx2gene = getBM(attributes=c('ensembl_transcript_id','ensembl_gene_id'),
mart = ensembl)
Next, we import the quantification table to the R environment. To do this, we use the Bioconductor package rhdf5.
## Paths to h5 archives
files = file.path(list.dirs("./kallisto_data", recursive = FALSE),
"abundance.h5")
names(files) = str_extract(files, "SRR\\d+")
## Importation
txi.kallisto.tsv = tximport(files, type = "kallisto", tx2gene = tx2gene,
ignoreAfterBar = TRUE, ignoreTxVersion=TRUE)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
## summarizing abundance
## summarizing counts
## summarizing length
Next, we create a metadata table to know the treatment for each SRA sample. Since it wasn’t available on SRA Run Selector web page, we needed to manually create it.
## Metadata importation to R
meta = read.csv("./metadata.csv", header = FALSE)
samples = column_to_rownames(meta, var = "V1")
Now, we create a DESeq2 dataset-like object, called DESeqDataSet, using the imported counts, sample names and metadata table.
dds = DESeqDataSetFromTximport(txi.kallisto.tsv,
colData = samples,
design = ~V2)
## using counts and average transcript lengths from tximport
Before doing some exploration analysis, we pre-filter low counts.
## Pre-filtering
keep = rowSums(counts(dds)) >= 10
dds = dds[keep, ]
## Factor level
dds$V2 = factor(dds$V2, levels = c("Pre", "On"))
To identify if there’s a batch effect, we graph a PCA; normalizing the data with the variance stabilizing transformation.
vst = vst(dds, blind = FALSE)
Once we hve the normalized data, we graph a tridimensional and interactive PCA plot.
pca = prcomp(t(assay(vst)), rank. = 3)
components = pca[["x"]]
components = data.frame(components)
var = summary(pca)[["importance"]]['Proportion of Variance',]
var = 100*sum(var)
fig = plot_ly(components, x = ~PC1, y = ~PC2, z = ~PC3, color = dds$V2)
fig = fig %>% layout(title = "PCA")
fig
As we can observe, there is not a visible cluster suggesting a batch effect. In fact, by some reason, almos all the samples cluster into a big cloud, regardless to their treatment.
Once we know there are not batch effects in our data, we perform the Differential Expression Analysis.
dds = DESeq(dds)
To consider significant genes, we will use a significance level of \(\alpha=0.01\), and a log fold change expression greater than 1 or lower than -1.
res = results(dds, alpha = 0.01)
## Significative genes
resSig = subset(res, padj < 0.01)
dim(resSig)
## [1] 80 6
We have 80 significant genes. We can visualize them in a volcano plot.
res.df = as.data.frame(res)
## Add genes symbols
res.df$symbols = mapIds(org.Hs.eg.db, keys = rownames(res.df),
keytype = "ENSEMBL", column = "SYMBOL")
### Up and down genes
res.df$diffexpressed = "NO"
res.df$diffexpressed[res.df$log2FoldChange > 1 & res.df$padj < 0.01] <- "UP"
res.df$diffexpressed[res.df$log2FoldChange < -1 & res.df$padj < 0.01] <- "DOWN"
ggplot(data = res.df, aes(x=log2FoldChange, y=-log10(padj), col=diffexpressed, label=symbols)) +
geom_point()
Once we have identified the significant genes, we extract them in tsv files. One for the up-regulated ones, and another for the down-regulated ones.
## Up genes
resSigUP = subset(resSig, log2FoldChange >= 1)
resSigUP = cbind(rownames(resSigUP), resSigUP) # Converts rownames into first column
colnames(resSigUP)[1] = "GENE_ID" # Adds name to first column
write.table(resSigUP, file = "./DEG_results/UP_genes.tsv", quote = FALSE,
sep = "\t",row.names = FALSE)
## Down genes
resSigDOWN = subset(resSig, log2FoldChange <= -1)
resSigDOWN = cbind(rownames(resSigDOWN), resSigDOWN)
colnames(resSigDOWN)[1] = "GENE_ID"
write.table(resSigDOWN, file = "./DEG_results/DOWN_genes.tsv", quote = FALSE,
sep = "\t",row.names = FALSE)
We got 6 significant downregulated genes and 73 upregulated significant genes.
To make the GO terms enrichment, we use the org.Hs.eg.db
database, containing the annotated genes from Homo sapiens. The
ontology was made using the Biological Process (BP).
genes = rownames(resSig[resSig$log2FoldChange >= 1, ])
GO = enrichGO(gene = genes,
OrgDb = org.Hs.eg.db,
keyType = "ENSEMBL",
ont = "BP")
GO.df = as.data.frame(GO)
We can visualize the enriched genes depending on their biological process.
GO2 = setReadable(GO, OrgDb = org.Hs.eg.db)
cnetplot(GO2, node_label = "all",color_category='firebrick',
color_gene='steelblue', showCategory = 10)
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Pop!_OS 22.04 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/Mexico_City
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] enrichplot_1.20.0 org.Hs.eg.db_3.17.0
## [3] clusterProfiler_4.8.1 genefilter_1.82.1
## [5] ComplexHeatmap_2.16.0 gprofiler2_0.2.1
## [7] biomaRt_2.56.0 GenomicFeatures_1.52.0
## [9] AnnotationDbi_1.62.1 rhdf5_2.44.0
## [11] ggrepel_0.9.3 DESeq2_1.40.1
## [13] SummarizedExperiment_1.30.1 Biobase_2.60.0
## [15] MatrixGenerics_1.12.0 matrixStats_0.63.0
## [17] GenomicRanges_1.52.0 GenomeInfoDb_1.36.0
## [19] IRanges_2.34.0 S4Vectors_0.38.1
## [21] BiocGenerics_0.46.0 lubridate_1.9.2
## [23] forcats_1.0.0 stringr_1.5.0
## [25] dplyr_1.1.0 purrr_1.0.1
## [27] readr_2.1.4 tidyr_1.3.0
## [29] tibble_3.1.8 tidyverse_2.0.0
## [31] tximport_1.28.0 plotly_4.10.1
## [33] ggplot2_3.4.1
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.0 BiocIO_1.10.0 bitops_1.0-7
## [4] ggplotify_0.1.0 filelock_1.0.2 polyclip_1.10-4
## [7] XML_3.99-0.14 lifecycle_1.0.3 doParallel_1.0.17
## [10] lattice_0.21-8 MASS_7.3-59 crosstalk_1.2.0
## [13] magrittr_2.0.3 sass_0.4.5 rmarkdown_2.20
## [16] jquerylib_0.1.4 yaml_2.3.7 cowplot_1.1.1
## [19] DBI_1.1.3 RColorBrewer_1.1-3 zlibbioc_1.46.0
## [22] ggraph_2.1.0 RCurl_1.98-1.12 yulab.utils_0.0.6
## [25] tweenr_2.0.2 rappdirs_0.3.3 circlize_0.4.15
## [28] GenomeInfoDbData_1.2.10 tidytree_0.4.2 annotate_1.78.0
## [31] codetools_0.2-19 DelayedArray_0.26.2 DOSE_3.26.1
## [34] xml2_1.3.3 ggforce_0.4.1 tidyselect_1.2.0
## [37] shape_1.4.6 aplot_0.1.10 farver_2.1.1
## [40] viridis_0.6.3 BiocFileCache_2.8.0 GenomicAlignments_1.36.0
## [43] jsonlite_1.8.4 GetoptLong_1.0.5 ellipsis_0.3.2
## [46] tidygraph_1.2.3 survival_3.5-5 iterators_1.0.14
## [49] foreach_1.5.2 tools_4.3.0 progress_1.2.2
## [52] treeio_1.24.0 Rcpp_1.0.10 glue_1.6.2
## [55] gridExtra_2.3 xfun_0.39 qvalue_2.32.0
## [58] withr_2.5.0 fastmap_1.1.0 rhdf5filters_1.12.1
## [61] fansi_1.0.4 digest_0.6.31 timechange_0.2.0
## [64] R6_2.5.1 gridGraphics_0.5-1 colorspace_2.1-0
## [67] GO.db_3.17.0 RSQLite_2.3.1 utf8_1.2.3
## [70] generics_0.1.3 data.table_1.14.8 rtracklayer_1.60.0
## [73] prettyunits_1.1.1 graphlayouts_1.0.0 httr_1.4.4
## [76] htmlwidgets_1.6.2 S4Arrays_1.0.4 scatterpie_0.1.9
## [79] pkgconfig_2.0.3 gtable_0.3.1 blob_1.2.3
## [82] XVector_0.40.0 shadowtext_0.1.2 htmltools_0.5.4
## [85] fgsea_1.26.0 clue_0.3-64 scales_1.2.1
## [88] png_0.1-8 ggfun_0.0.9 knitr_1.42
## [91] rstudioapi_0.14 tzdb_0.3.0 reshape2_1.4.4
## [94] rjson_0.2.21 nlme_3.1-162 curl_5.0.0
## [97] cachem_1.0.6 GlobalOptions_0.1.2 parallel_4.3.0
## [100] HDO.db_0.99.1 restfulr_0.0.15 pillar_1.8.1
## [103] vctrs_0.5.2 dbplyr_2.3.0 xtable_1.8-4
## [106] cluster_2.1.4 evaluate_0.20 cli_3.6.0
## [109] locfit_1.5-9.7 compiler_4.3.0 Rsamtools_2.16.0
## [112] rlang_1.0.6 crayon_1.5.2 labeling_0.4.2
## [115] plyr_1.8.8 stringi_1.7.12 viridisLite_0.4.1
## [118] BiocParallel_1.34.1 assertthat_0.2.1 munsell_0.5.0
## [121] Biostrings_2.68.1 lazyeval_0.2.2 GOSemSim_2.26.0
## [124] Matrix_1.5-1 patchwork_1.1.2 hms_1.1.2
## [127] bit64_4.0.5 Rhdf5lib_1.22.0 KEGGREST_1.40.0
## [130] highr_0.10 igraph_1.4.3 memoise_2.0.1
## [133] bslib_0.4.2 ggtree_3.8.0 fastmatch_1.1-3
## [136] bit_4.0.5 downloader_0.4 gson_0.1.0
## [139] ape_5.7-1